a.b.c. Are you working on a REGRESSION or CLASSIFICATION problem?
The given problem can be approached as both REGRESSION OR CLASSIFICATION problem. I have decided to approach this as a classification problem by categorizing tracks with popularity < 50 as 0 (Unpopular) and songs with popularity > 50 as 1 (Popular) in a new columns called as track_popularity_bin. The goal is to come up with a classification model that would classify a track based on the characteristics as 0 or 1
d. Describe the major findings you have identified based on your analysis.
duration_ms and high loudness tend to be more populare. Which inputs/features seem to influence the response/outcome the most?
Following variables are the final input variables I have identifed after EDA
f. What supported your conclusion? Was it only through predictive models?
Predictive Models, their coefficients and cross-validation
Also, following visualizations helped with identifying the potential input-to-output relationships
g. Could EDA help identify similar trends/relationships?
h. Was clustering consistent with any conclusions from the predictive models?
i. What skills did you learn from going through this project?
j. What general skills can you take away from the project to apply to applications more specific to your area of interest?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.cluster import hierarchy
import statsmodels.formula.api as smf\
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
sns.set_palette("colorblind")
songs_url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'
df = pd.read_csv(songs_url)
df.shape
(32833, 23)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32833 entries, 0 to 32832 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 track_id 32833 non-null object 1 track_name 32828 non-null object 2 track_artist 32828 non-null object 3 track_popularity 32833 non-null int64 4 track_album_id 32833 non-null object 5 track_album_name 32828 non-null object 6 track_album_release_date 32833 non-null object 7 playlist_name 32833 non-null object 8 playlist_id 32833 non-null object 9 playlist_genre 32833 non-null object 10 playlist_subgenre 32833 non-null object 11 danceability 32833 non-null float64 12 energy 32833 non-null float64 13 key 32833 non-null int64 14 loudness 32833 non-null float64 15 mode 32833 non-null int64 16 speechiness 32833 non-null float64 17 acousticness 32833 non-null float64 18 instrumentalness 32833 non-null float64 19 liveness 32833 non-null float64 20 valence 32833 non-null float64 21 tempo 32833 non-null float64 22 duration_ms 32833 non-null int64 dtypes: float64(9), int64(4), object(10) memory usage: 5.8+ MB
df.dtypes
track_id object track_name object track_artist object track_popularity int64 track_album_id object track_album_name object track_album_release_date object playlist_name object playlist_id object playlist_genre object playlist_subgenre object danceability float64 energy float64 key int64 loudness float64 mode int64 speechiness float64 acousticness float64 instrumentalness float64 liveness float64 valence float64 tempo float64 duration_ms int64 dtype: object
df.isna().sum()
track_id 0 track_name 5 track_artist 5 track_popularity 0 track_album_id 0 track_album_name 5 track_album_release_date 0 playlist_name 0 playlist_id 0 playlist_genre 0 playlist_subgenre 0 danceability 0 energy 0 key 0 loudness 0 mode 0 speechiness 0 acousticness 0 instrumentalness 0 liveness 0 valence 0 tempo 0 duration_ms 0 dtype: int64
df.nunique()
track_id 28356 track_name 23449 track_artist 10692 track_popularity 101 track_album_id 22545 track_album_name 19743 track_album_release_date 4530 playlist_name 449 playlist_id 471 playlist_genre 6 playlist_subgenre 24 danceability 822 energy 952 key 12 loudness 10222 mode 2 speechiness 1270 acousticness 3731 instrumentalness 4729 liveness 1624 valence 1362 tempo 17684 duration_ms 19785 dtype: int64
# Create a target variable - track_populairy_bin (y)
df['track_popularity_bin'] = np.where( df.track_popularity > 50, 1, 0 )
df = df.astype({'track_popularity_bin': 'object'})
df.track_popularity_bin.value_counts(normalize=True)
track_popularity_bin 0 0.574757 1 0.425243 Name: proportion, dtype: float64
# Removing duplicates - More details in the EDA notebook
df.drop_duplicates(subset=['track_id'], keep='first', inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 28356 entries, 0 to 32832 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 track_id 28356 non-null object 1 track_name 28352 non-null object 2 track_artist 28352 non-null object 3 track_popularity 28356 non-null int64 4 track_album_id 28356 non-null object 5 track_album_name 28352 non-null object 6 track_album_release_date 28356 non-null object 7 playlist_name 28356 non-null object 8 playlist_id 28356 non-null object 9 playlist_genre 28356 non-null object 10 playlist_subgenre 28356 non-null object 11 danceability 28356 non-null float64 12 energy 28356 non-null float64 13 key 28356 non-null int64 14 loudness 28356 non-null float64 15 mode 28356 non-null int64 16 speechiness 28356 non-null float64 17 acousticness 28356 non-null float64 18 instrumentalness 28356 non-null float64 19 liveness 28356 non-null float64 20 valence 28356 non-null float64 21 tempo 28356 non-null float64 22 duration_ms 28356 non-null int64 23 track_popularity_bin 28356 non-null object dtypes: float64(9), int64(4), object(11) memory usage: 5.4+ MB
df.describe(include='object')
| track_id | track_name | track_artist | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | track_popularity_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 28356 | 28352 | 28352 | 28356 | 28352 | 28356 | 28356 | 28356 | 28356 | 28356 | 28356 |
| unique | 28356 | 23449 | 10692 | 22545 | 19743 | 4530 | 448 | 470 | 6 | 24 | 2 |
| top | 6f807x0ima9a1j3VPbc7VN | Breathe | Queen | 5L1xcowSxwzFUSJzvyMp48 | Greatest Hits | 2020-01-10 | Indie Poptimism | 72r6odw0Q3OWTCYMGA7Yiy | rap | southern hip hop | 0 |
| freq | 1 | 18 | 130 | 42 | 135 | 201 | 294 | 100 | 5401 | 1583 | 17850 |
❌ There are far too many unique values in the columns track_artist, playlist_name, track_album_name, track_name. This makes these columns not very useful for training a model and it is not practical to show visualization for these columns. Note: Reg visualization I confirmed with the instructor on the Coursera Discussion Forum that it is not necessary to show visualization for categorical variables with far too many unique values.
❌ Identifier columns like track_id, track_album_id and playlist_id will also not be very useful
df.drop(['track_id','track_album_id','playlist_id','track_artist','playlist_name', 'track_name', 'track_album_name'],
inplace=True,
axis=1)
The given data presents us with the opportunity to create additional columns. These new columns may have an impact on the target variable.
In this case, I have used the column track_album_release_date to create two new columns release_year and release_month
df['track_album_release_date'] = pd.to_datetime(df['track_album_release_date'], format='mixed')
df['release_year'] = df.track_album_release_date.dt.year
sns.catplot(data = df, y='release_year', kind='count', height=10, aspect=2)
plt.show()
💡 As seen above, the dataset is mostly made of songs released in the recent years rather than older years. This suggests that we can probably create one more variable that represents the release_year in two buckets. We can put songs released after 2010 in one bucket and songs released before 2010 in another
df['release_year_bin'] = np.where( df.release_year < 2010 , 'older', 'recent')
df['release_month'] = df.track_album_release_date.dt.month
sns.catplot(data = df, y='release_month', kind='count', height=8, aspect=1.5)
plt.show()
💡 Majority of songs in the given dataset are release in the month of Jan
Since, track_album_release_date has been split into release_year and release_month the column can be dropped
df.drop(['track_album_release_date'], axis=1, inplace=True)
💡 Let's visualize additional variables - key and mode
Although, key and mode are numeric columns, since there are less unique values these can be considered as categorical variables
sns.catplot(data = df, y='key', kind='count', height=5, aspect=2)
<seaborn.axisgrid.FacetGrid at 0x7fd78189e080>
sns.catplot(data = df, y='mode', kind='count', height=2, aspect=2)
<seaborn.axisgrid.FacetGrid at 0x7fd7bc4d4580>
df.describe()
| track_popularity | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | release_year | release_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.00000 | 28356.000000 | 28356.000000 | 28356.000000 |
| mean | 39.329771 | 0.653372 | 0.698388 | 5.368000 | -6.817696 | 0.565489 | 0.107954 | 0.177176 | 0.091117 | 0.190958 | 0.510387 | 120.95618 | 226575.967026 | 2011.054027 | 6.101813 |
| std | 23.702376 | 0.145785 | 0.183503 | 3.613904 | 3.036243 | 0.495701 | 0.102556 | 0.222803 | 0.232548 | 0.155894 | 0.234340 | 26.95456 | 61078.450819 | 11.229221 | 3.841027 |
| min | 0.000000 | 0.000000 | 0.000175 | 0.000000 | -46.448000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 4000.000000 | 1957.000000 | 1.000000 |
| 25% | 21.000000 | 0.561000 | 0.579000 | 2.000000 | -8.309250 | 0.000000 | 0.041000 | 0.014375 | 0.000000 | 0.092600 | 0.329000 | 99.97200 | 187742.000000 | 2008.000000 | 2.000000 |
| 50% | 42.000000 | 0.670000 | 0.722000 | 6.000000 | -6.261000 | 1.000000 | 0.062600 | 0.079700 | 0.000021 | 0.127000 | 0.512000 | 121.99300 | 216933.000000 | 2016.000000 | 6.000000 |
| 75% | 58.000000 | 0.760000 | 0.843000 | 9.000000 | -4.709000 | 1.000000 | 0.133000 | 0.260000 | 0.006570 | 0.249000 | 0.695000 | 133.99900 | 254975.250000 | 2019.000000 | 10.000000 |
| max | 100.000000 | 0.983000 | 1.000000 | 11.000000 | 1.275000 | 1.000000 | 0.918000 | 0.994000 | 0.994000 | 0.996000 | 0.991000 | 239.44000 | 517810.000000 | 2020.000000 | 12.000000 |
df_features = df.select_dtypes('number').copy()
df_features.drop(['track_popularity'], axis=1, inplace=True) # Dropping Target variable
df_objects = df.select_dtypes('object').copy()
id_cols = ['rowid', 'track_popularity'] + df_objects.columns.to_list()
df_lf = df.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols, value_vars=df_features.columns)
sns.displot(data = df_lf, x='value', col='variable', kind='hist', kde=True,
facet_kws={'sharex': False, 'sharey': False},
common_bins=False,
col_wrap=3)
plt.subplots_adjust(hspace=0.5)
plt.tight_layout()
plt.tight_layout()
plt.show()
We can see that speechiness, instrumentalness, liveness, acousticness and duration_ms are skewed right.
loudness, mode are skewed left.
Only danceability, energy, valence, tempo have normal / approx normal distribution.
📌 Before we can use the data for modeling, we must transform the left and right skewed features have more symmetrical and bell-shaped distributions.
sns.catplot( data = df, x='track_popularity_bin', hue='playlist_genre', kind='count' )
plt.show()
💡 The above visualization between track_popularity_bin and playlist_genre using DODGED BAR CHART shows that the edm genre is the most unpopular genre. Other genres have approx same number of tracks across both unpopular and popular categories. We can also see that pop and rap are the most popular genres
fig, ax = plt.subplots(figsize=(20,10))
sns.heatmap( pd.crosstab( df.playlist_subgenre, df.playlist_genre ), ax = ax,
annot=True, annot_kws={'size': 10}, fmt='d',
cbar=False)
plt.show()
📌 Above heatmap shows that the playlist_subgenre is highly correlated with playlist_genre. With this information, playlist_subgenre field can be dropped from the dataset as having highly correlated variables doesn't add a lot of value for the final model.
df.drop(columns=['playlist_subgenre'], inplace=True)
sns.catplot( data = df, x='key', y='track_popularity', kind='point', linestyle='none')
plt.show()
💡 Tracks with higher popularity tend to have higher value for key
sns.catplot( data = df, x='mode', y='track_popularity', kind='point', linestyle='none')
plt.show()
💡 Tracks with higher popularity tend to have 1 as the mode
Correlation plot is one of the very effective way to view the relationships between continuous variables
Below is the corr plot for the given dataset
fig, ax = plt.subplots(figsize=(20,15))
sns.heatmap(data = df.select_dtypes('number').corr(numeric_only=True),
vmin=-1, vmax=1, center=0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 12},
ax=ax)
plt.tight_layout()
plt.show()
💡 The above plot reveals the following
energy and loudness are highly positively correlatedenergy and acousticness are highly negatively correlatedtrack_popularity has a low correlation with all other variables. This is good because we can use the other variables to "predict" track popularitysns.relplot(data = df, x='duration_ms', y='loudness')
plt.show()
💡 Tracks that are longer tend to be less louder
sns.relplot(data = df, x='acousticness', y='loudness')
plt.show()
💡 Tracks that are more acoustic tend to be less louder than the tracks that are less acoustic
sns.relplot(data = df, x='valence', y='danceability')
plt.show()
💡 Danceability of tracks increase with increase in valence
Will be visualizing the conditional distribution of continuous inputs group by the response variable track_popularity_bin
sns.displot(data = df_lf, x='value', col='variable', kind='kde',
hue='track_popularity_bin',
facet_kws={'sharex': False, 'sharey': False},
common_norm=False,
col_wrap=3
)
plt.show()
💡 Observations
danceability of popular songs (track_popularity_bin=1) is higher than the unpopular songs (track_popularity_bin=0)acousticness on popular songs is lower than the unpopular songsHere, leveraging PAIR PLOT to visualize the relationships between continuous inputs GROUPED BY the response (outcome) unique values
🚨 Splitting into two pair plot because notebook gets stuck
PAIR PLOT - 1
sns.pairplot(data=df[[ 'danceability', 'loudness', 'tempo',
'valence', 'acousticness',
'duration_ms', 'track_popularity_bin']],
hue='track_popularity_bin',
diag_kws={'common_norm': False})
plt.show()
💡 Observations
danceability increases as tempo increasestempo and danceability score one can determine the popularity of a trackPAIR PLOT - 2
sns.pairplot(data=df[[ 'energy', 'key', 'mode',
'speechiness', 'instrumentalness',
'liveness', 'track_popularity_bin']],
hue='track_popularity_bin',
diag_kws={'common_norm': False})
plt.show()
💡 Observations
speechiness tend to have high instrumentalnessspeechiness as well as high energy. This could be because of tracks from rap genre that are speechy as well as highly energeticliveness tend to have high energy. This could be because songs that have high energy are more probable to be performed infront of live audiencemode of 1 tend to have high key valuesns.catplot(data = df, x='playlist_genre', hue='track_popularity_bin', col='release_year_bin', kind='count', )
plt.show()
💡 Observations
rock used to be the most popular genre in the older years and it has been replaced by pop and rap in the recent yearsedm is the most unpopular genre in the recent years# Dropping Unused columns
df.drop(columns=['track_popularity', 'release_year', 'release_month'], inplace=True)
columns_to_use = ['danceability', 'energy', 'key', 'loudness', 'mode',
'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms',
'track_popularity_bin', 'playlist_genre']
df_kmeans = df[columns_to_use]
df_features_clean = df_kmeans.select_dtypes('number').copy()
df_kmeans.isna().sum()
danceability 0 energy 0 key 0 loudness 0 mode 0 speechiness 0 acousticness 0 instrumentalness 0 liveness 0 valence 0 tempo 0 duration_ms 0 track_popularity_bin 0 playlist_genre 0 dtype: int64
💡 There are no MISSING VALUES in the dataset
sns.catplot(data = df_features_clean, kind='box', aspect=2)
plt.show()
💡 Since one variable is dominant the data has to be standardized first remove the MAGNITUDE and SCALE effect. KMeans considers SIMILAR to be based on DISTANCE. Distance depends on MAGNITUDE and SCALE
X = StandardScaler().fit_transform(df_kmeans_features_clean)
sns.catplot(data = pd.DataFrame(X, columns=df_features_clean.columns), kind='box', aspect=2)
plt.show()
📌 Variables have now been standardized
clusters_2 = KMeans(n_clusters=2, random_state=121, n_init=25, max_iter=500).fit_predict(X)
df_clean_copy = df_kmeans.copy()
df_clean_copy['k2'] = pd.Series(clusters_2, index=df_clean_copy.index ).astype('category')
df_clean_copy.k2.value_counts()
k2 0 20078 1 8278 Name: count, dtype: int64
fig, ax = plt.subplots(figsize=(20,5))
sns.heatmap(data = pd.crosstab(df_clean_copy.track_popularity_bin,
df_clean_copy.k2,
margins=True ),
annot=True,
annot_kws={"fontsize": 20},
fmt='g',
cbar=False,
ax=ax)
plt.show()
The above heatmap tells us that most songs have ended up in cluster 0. This suggests that more clusters are needed to find songs with distinctive characteristics that end up in the popular and unpopular categories
Will be finding optimal number of clusters using the KNEE BEND PLOT!
tots_within = []
K = range(1, 15)
for k in K:
km = KMeans(n_clusters=k, random_state=121, n_init=25, max_iter=500)
km = km.fit(X)
tots_within.append( km.inertia_ )
fig, ax = plt.subplots()
ax.plot( K, tots_within, 'bo-' )
ax.set_xlabel('number of clusters')
ax.set_ylabel('total within sum of squares')
plt.show()
📌 Although, there isn't a clean KNEE BEND here we can see that the plot starts to bend around the cluster value of 5 and is prominent before 8. So, I have decided to go with 7 clusters for further analysis
clusters_7 = KMeans(n_clusters=7, random_state=121, n_init=25, max_iter=500).fit_predict(X)
df_clean_copy['k7'] = pd.Series(clusters_7, index=df_clean_copy.index ).astype('category')
df_clean_copy.k7.value_counts()
k7 3 6844 4 5963 2 5038 6 3510 5 3123 0 2211 1 1667 Name: count, dtype: int64
fig, ax = plt.subplots(figsize=(20,5))
sns.heatmap(data = pd.crosstab(df_clean_copy.track_popularity_bin,
df_clean_copy.k7,
margins=True ),
annot=True,
annot_kws={"fontsize": 20},
fmt='g',
cbar=False,
ax=ax)
plt.show()
💡 The above heatmap tells that clusters 0,1,2 tend to have more unpopular songs than clusters 3,4,5,6. This is much better than two clusters
Finally, using PAIR PLOT to visualize the relationships between continuous variables GROUPED BY the cluster category as well as conditional distribution of each continuous variable GROUPED BY the cluster category
# NOTE: I have used a sample of 5000 because my notebook kept crashing
sns.pairplot(data = df_clean_copy.sample(5000), hue='k7', diag_kws={'common_norm': False},
palette='viridis')
plt.show()
💡 Observations
energy compared to the clusters 3,4,5,6energy and duration_ms between 200000 and 300000instrumentalness and high energyhclust_ward = hierarchy.ward(X)
fig = plt.figure(figsize=(12, 6))
dn = hierarchy.dendrogram(hclust_ward, no_labels=True )
plt.show()
np.unique( hierarchy.cut_tree( hclust_ward, height=180).ravel() )
array([0, 1, 2, 3])
df_clean_copy['hclust_4'] = pd.Series( hierarchy.cut_tree( hclust_ward, height=180).ravel(),
index=df_clean_copy.index ).astype('object')
sns.catplot(data = df_clean_copy, x='hclust_4', kind='count')
plt.show()
Crosstab
fig, ax = plt.subplots()
sns.heatmap(
data = pd.crosstab(
df.track_popularity_bin, df_clean_copy.hclust_4, margins=True,
),
annot=True,
annot_kws={'fontsize': 20}, fmt='g',
cbar=False,
ax=ax)
plt.show()
fig, ax = plt.subplots()
sns.heatmap(
data = pd.crosstab(
df.track_popularity_bin, df_clean_copy.hclust_4, margins=True, normalize='columns'
)*100,
annot=True,
annot_kws={'fontsize': 10}, fmt='g',
cbar=False,
ax=ax)
plt.show()
💡 Cluster 2 has the highest percentage of unpopular songs
Visualize distribution of individual features using BOXPLOT across different clusters
# Reshape to long format to enable exploring ALL of the original variables GROUPED BY the clusters!
df_clean_copy_lf = df_clean_copy.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=['rowid', 'k7', 'k2', 'hclust_4', 'track_popularity_bin'],
value_vars=df_features_clean[
['danceability', 'energy', 'key', 'loudness', 'mode',
'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms',
]])
df_clean_copy_lf
| rowid | k7 | k2 | hclust_4 | track_popularity_bin | variable | value | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 0 | 0 | 1 | danceability | 0.748 |
| 1 | 1 | 3 | 0 | 0 | 1 | danceability | 0.726 |
| 2 | 2 | 4 | 0 | 1 | 1 | danceability | 0.675 |
| 3 | 3 | 2 | 0 | 0 | 1 | danceability | 0.718 |
| 4 | 4 | 3 | 0 | 0 | 1 | danceability | 0.650 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 340267 | 32828 | 2 | 0 | 0 | 0 | duration_ms | 204375.000 |
| 340268 | 32829 | 2 | 0 | 0 | 0 | duration_ms | 353120.000 |
| 340269 | 32830 | 2 | 0 | 1 | 0 | duration_ms | 210112.000 |
| 340270 | 32831 | 2 | 0 | 0 | 0 | duration_ms | 367432.000 |
| 340271 | 32832 | 1 | 0 | 1 | 0 | duration_ms | 337500.000 |
340272 rows × 7 columns
sns.catplot(data = df_clean_copy_lf[['hclust_4','variable','value']],
x='hclust_4',
y='value',
col='variable',
col_wrap=3,
kind='box',
sharey=False)
plt.show()
💡 Observations
General Insights
Visualize relationships using PAIR PLOT
sns.pairplot(data=df_clean_copy[['danceability', 'energy', 'key', 'loudness', 'mode',
'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms','hclust_4']],
hue='hclust_4',
diag_kws={'common_norm': False},
palette='viridis')
plt.show()
💡 Observations
# Reducing skewness by applying log transform
df['speechiness_log'] = df['speechiness'].apply(lambda x: np.log(x) if x > 0 else 0)
df['acousticness_log'] = df['acousticness'].apply(lambda x: np.log(x) if x > 0 else 0)
df['instrumentalness_log'] = df['instrumentalness'].apply(lambda x: np.log(x) if x > 0 else 0)
df['liveness_log'] = df['liveness'].apply(lambda x: np.log(x) if x > 0 else 0)
df['duration_ms_log'] = df['duration_ms'].apply(lambda x: np.log(x) if x > 0 else 0)
Define a function to FIT and CALCULATE the training set performance
def fit_and_assess_logistic(mod_name, a_formula, train_data, threshold):
a_mod = smf.logit(formula=a_formula, data=train_data).fit()
train_copy = train_data.copy()
train_copy['pred_probability'] = a_mod.predict(train_data)
train_copy['pred_class'] = np.where(train_copy.pred_probability > threshold, 1, 0 )
TN, FP, FN, TP = confusion_matrix(train_copy.y.to_numpy(), train_copy.pred_class.to_numpy() ).ravel()
Accuracy = (TN + TP) / (TN + FP + FN + TP)
Sensitivity = (TP) / (TP + FN)
Specificity = (TN) / (TN + FP)
FPR = 1 - Specificity
ROC_AUC = roc_auc_score(train_copy.y.to_numpy(), train_copy.pred_probability.to_numpy() )
res_dict = {'model_name': mod_name,
'model_formula': a_formula,
'num_coefs': len( a_mod.params ),
'threshold': threshold,
'Accuracy': Accuracy,
'Sensitivity': Sensitivity,
'Specificity': Specificity,
'FPR': FPR,
'ROC_AUC': ROC_AUC}
return pd.DataFrame( res_dict, index=[0] )
# Create a column `y` which is ditto of the target variable `track_popularity_bin`
# This is done so that applying formula and interpretation is easy
df['y'] = df['track_popularity_bin']
df = df.astype({'y': 'int'})
df = df.astype({'track_popularity_bin': 'int'})
fit_and_assess_logistic(0, 'y ~ 1', train_data=df, threshold=0.5)
Optimization terminated successfully.
Current function value: 0.659223
Iterations 4
| model_name | model_formula | num_coefs | threshold | Accuracy | Sensitivity | Specificity | FPR | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | y ~ 1 | 1 | 0.5 | 0.629496 | 0.0 | 1.0 | 0.0 | 0.5 |
formula_list = [
# Intercept Only
'y ~ 1',
# Categorical inputs with additive features
'y ~ playlist_genre + key + mode',
# Continuous inputs with linear additive features
'y ~ tempo + loudness + valence + danceability + energy + duration_ms_log + speechiness_log + instrumentalness_log + liveness_log + acousticness_log',
# All Inputs
'y ~ playlist_genre + key + mode + tempo + loudness + valence + danceability + energy + duration_ms_log + speechiness_log + instrumentalness_log + liveness_log + acousticness_log',
# Continuous inputs with linear main effect and pair-wise interactions.
'y ~ (tempo + loudness + valence + danceability + energy + duration_ms_log + speechiness_log + instrumentalness_log + liveness_log + acousticness_log) ** 2',
# Interact the categorical inputs with the continuous inputs. This model must include the linear main effects as well (the formula interface helps with the interactions).
'y ~ playlist_genre * (tempo + loudness + valence + danceability + energy + duration_ms_log + speechiness_log + instrumentalness_log + liveness_log + acousticness_log)',
# Based on EDA
'y ~ mode * (tempo + loudness + valence + danceability + energy + duration_ms_log + speechiness_log + instrumentalness_log + liveness_log + acousticness_log)',
'y ~ playlist_genre * ( (tempo + loudness + valence + danceability + energy + duration_ms_log + speechiness_log + instrumentalness_log + liveness_log + acousticness_log)**2 + np.power(tempo,2) + np.power(loudness,2) + np.power(valence,2) + np.power(danceability,2) + np.power(energy,2) + np.power(duration_ms_log,2) + np.power(speechiness_log,2) + np.power(instrumentalness_log,2) + np.power(liveness_log,2) + np.power(acousticness_log,2))',
]
results_list=[]
for x in range(len(formula_list)):
results_list.append(
fit_and_assess_logistic(x, formula_list[x], train_data = df, threshold=0.5) )
Optimization terminated successfully.
Current function value: 0.659223
Iterations 4
Optimization terminated successfully.
Current function value: 0.635441
Iterations 5
Optimization terminated successfully.
Current function value: 0.642106
Iterations 5
Optimization terminated successfully.
Current function value: 0.620351
Iterations 6
Optimization terminated successfully.
Current function value: 0.633988
Iterations 6
Optimization terminated successfully.
Current function value: 0.612935
Iterations 7
Optimization terminated successfully.
Current function value: 0.641314
Iterations 5
Optimization terminated successfully.
Current function value: 0.597308
Iterations 15
results_df = pd.concat(results_list, ignore_index=True)
results_df
| model_name | model_formula | num_coefs | threshold | Accuracy | Sensitivity | Specificity | FPR | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | y ~ 1 | 1 | 0.5 | 0.629496 | 0.000000 | 1.000000 | 0.000000 | 0.500000 |
| 1 | 1 | y ~ playlist_genre + key + mode | 8 | 0.5 | 0.629496 | 0.000000 | 1.000000 | 0.000000 | 0.614164 |
| 2 | 2 | y ~ tempo + loudness + valence + danceability ... | 11 | 0.5 | 0.631471 | 0.094993 | 0.947227 | 0.052773 | 0.611263 |
| 3 | 3 | y ~ playlist_genre + key + mode + tempo + loud... | 18 | 0.5 | 0.645013 | 0.246907 | 0.879328 | 0.120672 | 0.661755 |
| 4 | 4 | y ~ (tempo + loudness + valence + danceability... | 56 | 0.5 | 0.638489 | 0.138778 | 0.932605 | 0.067395 | 0.629157 |
| 5 | 5 | y ~ playlist_genre * (tempo + loudness + valen... | 66 | 0.5 | 0.657215 | 0.283267 | 0.877311 | 0.122689 | 0.677108 |
| 6 | 6 | y ~ mode * (tempo + loudness + valence + dance... | 22 | 0.5 | 0.632741 | 0.103560 | 0.944202 | 0.055798 | 0.613896 |
| 7 | 7 | y ~ playlist_genre * ( (tempo + loudness + val... | 396 | 0.5 | 0.669453 | 0.360937 | 0.851036 | 0.148964 | 0.701659 |
results_df.sort_values(by=['Accuracy'], ascending=False)
| model_name | model_formula | num_coefs | threshold | Accuracy | Sensitivity | Specificity | FPR | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|
| 7 | 7 | y ~ playlist_genre * ( (tempo + loudness + val... | 396 | 0.5 | 0.669453 | 0.360937 | 0.851036 | 0.148964 | 0.701659 |
| 5 | 5 | y ~ playlist_genre * (tempo + loudness + valen... | 66 | 0.5 | 0.657215 | 0.283267 | 0.877311 | 0.122689 | 0.677108 |
| 3 | 3 | y ~ playlist_genre + key + mode + tempo + loud... | 18 | 0.5 | 0.645013 | 0.246907 | 0.879328 | 0.120672 | 0.661755 |
| 4 | 4 | y ~ (tempo + loudness + valence + danceability... | 56 | 0.5 | 0.638489 | 0.138778 | 0.932605 | 0.067395 | 0.629157 |
| 6 | 6 | y ~ mode * (tempo + loudness + valence + dance... | 22 | 0.5 | 0.632741 | 0.103560 | 0.944202 | 0.055798 | 0.613896 |
| 2 | 2 | y ~ tempo + loudness + valence + danceability ... | 11 | 0.5 | 0.631471 | 0.094993 | 0.947227 | 0.052773 | 0.611263 |
| 0 | 0 | y ~ 1 | 1 | 0.5 | 0.629496 | 0.000000 | 1.000000 | 0.000000 | 0.500000 |
| 1 | 1 | y ~ playlist_genre + key + mode | 8 | 0.5 | 0.629496 | 0.000000 | 1.000000 | 0.000000 | 0.614164 |
results_df.sort_values(by=['ROC_AUC'], ascending=False)
| model_name | model_formula | num_coefs | threshold | Accuracy | Sensitivity | Specificity | FPR | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|
| 7 | 7 | y ~ playlist_genre * ( (tempo + loudness + val... | 396 | 0.5 | 0.669453 | 0.360937 | 0.851036 | 0.148964 | 0.701659 |
| 5 | 5 | y ~ playlist_genre * (tempo + loudness + valen... | 66 | 0.5 | 0.657215 | 0.283267 | 0.877311 | 0.122689 | 0.677108 |
| 3 | 3 | y ~ playlist_genre + key + mode + tempo + loud... | 18 | 0.5 | 0.645013 | 0.246907 | 0.879328 | 0.120672 | 0.661755 |
| 4 | 4 | y ~ (tempo + loudness + valence + danceability... | 56 | 0.5 | 0.638489 | 0.138778 | 0.932605 | 0.067395 | 0.629157 |
| 1 | 1 | y ~ playlist_genre + key + mode | 8 | 0.5 | 0.629496 | 0.000000 | 1.000000 | 0.000000 | 0.614164 |
| 6 | 6 | y ~ mode * (tempo + loudness + valence + dance... | 22 | 0.5 | 0.632741 | 0.103560 | 0.944202 | 0.055798 | 0.613896 |
| 2 | 2 | y ~ tempo + loudness + valence + danceability ... | 11 | 0.5 | 0.631471 | 0.094993 | 0.947227 | 0.052773 | 0.611263 |
| 0 | 0 | y ~ 1 | 1 | 0.5 | 0.629496 | 0.000000 | 1.000000 | 0.000000 | 0.500000 |
sns.relplot(data = results_df, x='num_coefs', y='Accuracy', kind='scatter')
plt.show()
sns.relplot(data = results_df, x='num_coefs', y='ROC_AUC', kind='scatter')
plt.show()
Model performance gets better with more coefficients
y ~ playlist_genre * ( (tempo + loudness + valence + danceability + energy + duration_ms_log + speechiness_log + instrumentalness_log + liveness_log + acousticness_log)**2 + np.power(tempo,2) + np.power(loudness,2) + np.power(valence,2) + np.power(danceability,2) + np.power(energy,2) + np.power(duration_ms_log,2) + np.power(speechiness_log,2) + np.power(instrumentalness_log,2) + np.power(liveness_log,2) + np.power(acousticness_log,2))def fit_logistic_make_roc(mod_name, a_formula, train_data):
a_mod = smf.logit(formula=a_formula, data=train_data).fit()
train_copy = train_data.copy()
train_copy['pred_probability'] = a_mod.predict( train_data )
fpr, tpr, threshold = roc_curve( train_data.y.to_numpy(), train_copy.pred_probability.to_numpy() )
res_df = pd.DataFrame({'tpr': tpr,
'fpr': fpr,
'threshold': threshold})
res_df['model_name'] = mod_name
res_df['model_formula'] = a_formula
return res_df
roc_list = []
for m in range( len(formula_list) ):
roc_list.append( fit_logistic_make_roc( m, formula_list[m], train_data=df))
Optimization terminated successfully.
Current function value: 0.659223
Iterations 4
Optimization terminated successfully.
Current function value: 0.635441
Iterations 5
Optimization terminated successfully.
Current function value: 0.642106
Iterations 5
Optimization terminated successfully.
Current function value: 0.620351
Iterations 6
Optimization terminated successfully.
Current function value: 0.633988
Iterations 6
Optimization terminated successfully.
Current function value: 0.612935
Iterations 7
Optimization terminated successfully.
Current function value: 0.641314
Iterations 5
Optimization terminated successfully.
Current function value: 0.597308
Iterations 15
roc_df = pd.concat( roc_list, ignore_index=True )
roc_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 75759 entries, 0 to 75758 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tpr 75759 non-null float64 1 fpr 75759 non-null float64 2 threshold 75759 non-null float64 3 model_name 75759 non-null int64 4 model_formula 75759 non-null object dtypes: float64(3), int64(1), object(1) memory usage: 2.9+ MB
roc_df['model_name'] = roc_df.model_name.astype('category')
roc_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 75759 entries, 0 to 75758 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tpr 75759 non-null float64 1 fpr 75759 non-null float64 2 threshold 75759 non-null float64 3 model_name 75759 non-null category 4 model_formula 75759 non-null object dtypes: category(1), float64(3), object(1) memory usage: 2.4+ MB
sns.relplot(data = roc_df, x='fpr', y='tpr', hue='model_name',
kind='line', estimator=None,)
plt.show()
# Model with all features
model_all_feat = smf.logit(formula=formula_list[3], data=df).fit()
Optimization terminated successfully.
Current function value: 0.620351
Iterations 6
print(np.abs(model_all_feat.pvalues[model_all_feat.pvalues < 0.05]).sort_values().to_string())
playlist_genre[T.pop] 2.617667e-184 playlist_genre[T.rock] 1.223407e-158 loudness 5.572640e-108 playlist_genre[T.rap] 3.086667e-84 playlist_genre[T.latin] 5.867846e-78 energy 3.397391e-60 playlist_genre[T.r&b] 2.052120e-35 Intercept 6.860019e-16 acousticness_log 1.010851e-15 duration_ms_log 3.554137e-15 danceability 3.869361e-10 tempo 1.036601e-06
# Best performing model
model_best = smf.logit(formula=formula_list[7], data=df).fit()
Optimization terminated successfully.
Current function value: 0.597308
Iterations 15
print(np.abs(model_best.pvalues[model_best.pvalues < 0.05]).sort_values().to_string())
playlist_genre[T.pop]:tempo:speechiness_log 0.000038 playlist_genre[T.rap]:tempo:speechiness_log 0.000046 playlist_genre[T.latin]:tempo:speechiness_log 0.000074 playlist_genre[T.pop]:np.power(loudness, 2) 0.000110 playlist_genre[T.r&b]:tempo:speechiness_log 0.000139 playlist_genre[T.rock]:danceability:instrumentalness_log 0.000164 tempo:speechiness_log 0.000210 duration_ms_log 0.000693 valence:instrumentalness_log 0.000765 np.power(duration_ms_log, 2) 0.000943 playlist_genre[T.rock]:tempo:valence 0.001122 Intercept 0.001265 playlist_genre[T.latin]:valence:instrumentalness_log 0.001375 playlist_genre[T.rock]:loudness:duration_ms_log 0.001454 loudness:duration_ms_log 0.001490 loudness 0.001938 playlist_genre[T.rap]:valence:instrumentalness_log 0.001972 playlist_genre[T.rap]:loudness 0.002020 playlist_genre[T.r&b]:loudness:valence 0.002022 playlist_genre[T.rap]:valence:speechiness_log 0.002107 playlist_genre[T.rap]:loudness:duration_ms_log 0.002553 playlist_genre[T.latin]:np.power(valence, 2) 0.003008 playlist_genre[T.r&b]:valence:instrumentalness_log 0.003140 playlist_genre[T.pop]:danceability:acousticness_log 0.003765 playlist_genre[T.rap]:tempo:instrumentalness_log 0.004021 danceability:instrumentalness_log 0.004088 playlist_genre[T.rap]:np.power(valence, 2) 0.004557 playlist_genre[T.rock]:loudness 0.005344 playlist_genre[T.rock]:energy:liveness_log 0.005627 playlist_genre[T.pop]:danceability:instrumentalness_log 0.005915 playlist_genre[T.pop]:loudness:valence 0.006003 playlist_genre[T.r&b]:tempo:valence 0.006055 playlist_genre[T.latin]:loudness:duration_ms_log 0.006093 playlist_genre[T.rap]:danceability:instrumentalness_log 0.006162 playlist_genre[T.pop]:valence:instrumentalness_log 0.006614 playlist_genre[T.latin]:loudness 0.007081 playlist_genre[T.rock]:energy:acousticness_log 0.007481 playlist_genre[T.rock]:valence:instrumentalness_log 0.007618 playlist_genre[T.pop]:tempo:valence 0.007925 playlist_genre[T.rock]:danceability:acousticness_log 0.009008 playlist_genre[T.r&b]:loudness:danceability 0.009556 playlist_genre[T.latin]:duration_ms_log 0.009633 loudness:valence 0.010414 np.power(loudness, 2) 0.011153 playlist_genre[T.rock]:np.power(loudness, 2) 0.011622 playlist_genre[T.pop]:loudness:energy 0.011790 playlist_genre[T.latin]:np.power(loudness, 2) 0.012232 playlist_genre[T.latin] 0.012817 tempo:valence 0.013268 playlist_genre[T.r&b]:tempo:liveness_log 0.013303 playlist_genre[T.r&b]:np.power(loudness, 2) 0.013421 playlist_genre[T.latin]:loudness:danceability 0.013539 playlist_genre[T.rock]:tempo:speechiness_log 0.013636 playlist_genre[T.rap]:valence:energy 0.015049 playlist_genre[T.latin]:np.power(duration_ms_log, 2) 0.015506 playlist_genre[T.r&b]:energy:liveness_log 0.015545 playlist_genre[T.latin]:danceability:energy 0.016139 valence:speechiness_log 0.017021 danceability:acousticness_log 0.019663 energy:acousticness_log 0.020762 liveness_log:acousticness_log 0.021000 playlist_genre[T.latin]:energy:instrumentalness_log 0.022695 playlist_genre[T.latin]:loudness:acousticness_log 0.022900 playlist_genre[T.rap]:tempo:valence 0.023506 playlist_genre[T.rock]:loudness:acousticness_log 0.024482 playlist_genre[T.rap]:danceability:acousticness_log 0.024661 playlist_genre[T.rap]:loudness:valence 0.024715 playlist_genre[T.r&b]:danceability:instrumentalness_log 0.030629 tempo:instrumentalness_log 0.031795 playlist_genre[T.pop]:valence:speechiness_log 0.033448 playlist_genre[T.r&b]:liveness_log:acousticness_log 0.037329 playlist_genre[T.latin]:liveness_log:acousticness_log 0.038067 playlist_genre[T.r&b]:valence:speechiness_log 0.038762 playlist_genre[T.rap]:np.power(loudness, 2) 0.039736 playlist_genre[T.latin]:energy:acousticness_log 0.042226 playlist_genre[T.r&b]:loudness:duration_ms_log 0.042251 playlist_genre[T.rock]:liveness_log:acousticness_log 0.044431 loudness:danceability 0.044651 playlist_genre[T.rock]:valence:speechiness_log 0.045373 playlist_genre[T.latin]:tempo:valence 0.046126 energy:liveness_log 0.047472 playlist_genre[T.pop]:speechiness_log 0.047958
🛸 loudness, duration_ms_log, playlist_genre are the significant features
input_grid = pd.DataFrame([(playlist_genre, danceability, energy, key, loudness, mode, valence, tempo, speechiness_log, acousticness_log, instrumentalness_log, liveness_log, duration_ms_log)
for playlist_genre in df.playlist_genre.unique()
for danceability in [df.danceability.mean()]
for energy in [df.energy.mean()]
for key in df.key.mode()
for loudness in np.linspace(df.loudness.min(), df.loudness.max(), num=5)
for mode in df['mode'].mode()
for valence in [df.valence.mean()]
for tempo in [df.tempo.mean()]
for speechiness_log in [df.speechiness_log.mean()]
for acousticness_log in [df.acousticness_log.mean()]
for instrumentalness_log in [df.instrumentalness_log.mean()]
for liveness_log in [df.liveness_log.mean()]
for duration_ms_log in np.linspace(df.duration_ms_log.min(), df.duration_ms_log.max(), num=101)
],
columns=['playlist_genre', 'danceability', 'energy', 'key', 'loudness', 'mode', 'valence', 'tempo', 'speechiness_log', 'acousticness_log', 'instrumentalness_log', 'liveness_log', 'duration_ms_log'])
input_grid.nunique()
playlist_genre 6 danceability 1 energy 1 key 1 loudness 5 mode 1 valence 1 tempo 1 speechiness_log 1 acousticness_log 1 instrumentalness_log 1 liveness_log 1 duration_ms_log 101 dtype: int64
dfviz = input_grid.copy()
dfviz['pred_probability_all'] = model_all_feat.predict(input_grid)
dfviz['pred_probability_best'] = model_best.predict(input_grid)
g = sns.relplot(data = dfviz, x='duration_ms_log', y='pred_probability_all',
hue='loudness',
col='playlist_genre',
kind='line',
palette='vlag', estimator=None, units='loudness')
plt.show()
👉 The above plot shows that regardless of genre shorter and louder songs tend to be more popular
g= sns.relplot(data = dfviz, x='duration_ms_log', y='pred_probability_best',
hue='loudness',
col='playlist_genre',
kind='line',
palette='vlag', estimator=None, units='loudness')
plt.show()
👉 Loudness consistently boosts predicted probability
👉 Different genres have different peak points, reflecting that ideal track length and how it interacts with loudness may vary widely depending on the genre.
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=101)
kf.get_n_splits()
5
input_names = df.drop(columns=['y']).copy().columns.to_list()
input_names
['playlist_genre', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'track_popularity_bin', 'release_year_bin', 'speechiness_log', 'acousticness_log', 'instrumentalness_log', 'liveness_log', 'duration_ms_log']
output_name = 'y'
def train_and_test_logistic_with_cv(mod_name, a_formula, data_df, x_names, y_name, cv, threshold=0.5):
# separate the inputs and output
input_df = data_df.loc[ :, x_names ].copy()
# initialize the performance metric storage lists
train_res = []
test_res = []
# SPLIT the data and iterate over the folds
for train_id, test_id in cv.split( input_df.to_numpy(), data_df[ y_name ].to_numpy() ):
# subset the training and test splits within each fold
train_data = data_df.iloc[ train_id, :].copy()
test_data = data_df.iloc[ test_id, :].copy()
# FIT the model on the TRAINING data within the current FOLD
a_mod = smf.logit(formula=a_formula, data=train_data).fit()
# PREDICT the TRAINING within each FOLD
train_copy = train_data.copy()
train_copy['pred_probability'] = a_mod.predict( train_data )
train_copy['pred_class'] = np.where( train_copy.pred_probability > threshold, 1, 0 )
# PREDICT the TESTING within each FOLD
test_copy = test_data.copy()
test_copy['pred_probability'] = a_mod.predict( test_data )
test_copy['pred_class'] = np.where( test_copy.pred_probability > threshold, 1, 0)
# calculate the PERFORMANCE METRIC on the TRAINING SET within the FOLD
train_res.append( np.mean( train_copy[ y_name ] == train_copy.pred_class ) )
# calculate the PERFORMANCE METRIC on the TESTING SET within the FOLD
test_res.append( np.mean( test_copy[ y_name ] == test_copy.pred_class ) )
# book keeping to store the results
train_df = pd.DataFrame({'Accuracy': train_res})
train_df['from_set'] = 'training'
train_df['fold_id'] = train_df.index + 1
test_df = pd.DataFrame({'Accuracy': test_res})
test_df['from_set'] = 'testing'
test_df['fold_id'] = test_df.index + 1
# combine the splits together
res_df = pd.concat([train_df, test_df], ignore_index=True)
# add information about the model
res_df['model_name'] = mod_name
res_df['model_formula'] = a_formula
res_df['num_coefs'] = len( a_mod.params )
res_df['threshold'] = threshold
return res_df
results_list = []
for m in range( len(formula_list) ):
print( '-- Formula ID %d --' % m )
try:
results_list.append( train_and_test_logistic_with_cv(m, formula_list[m], data_df=df, x_names=input_names, y_name=output_name, cv=kf) )
except:
# what happens if there is an ERROR
print( ' !!! Formula ID %d could NOT fit !!! ' % m)
-- Formula ID 0 --
Optimization terminated successfully.
Current function value: 0.659211
Iterations 4
Optimization terminated successfully.
Current function value: 0.659226
Iterations 4
Optimization terminated successfully.
Current function value: 0.659226
Iterations 4
Optimization terminated successfully.
Current function value: 0.659226
Iterations 4
Optimization terminated successfully.
Current function value: 0.659226
Iterations 4
-- Formula ID 1 --
Optimization terminated successfully.
Current function value: 0.635259
Iterations 5
Optimization terminated successfully.
Current function value: 0.635084
Iterations 5
Optimization terminated successfully.
Current function value: 0.634771
Iterations 5
Optimization terminated successfully.
Current function value: 0.636872
Iterations 5
Optimization terminated successfully.
Current function value: 0.635024
Iterations 5
-- Formula ID 2 --
Optimization terminated successfully.
Current function value: 0.641017
Iterations 5
Optimization terminated successfully.
Current function value: 0.642256
Iterations 5
Optimization terminated successfully.
Current function value: 0.642036
Iterations 5
Optimization terminated successfully.
Current function value: 0.642304
Iterations 5
Optimization terminated successfully.
Current function value: 0.642701
Iterations 5
-- Formula ID 3 --
Optimization terminated successfully.
Current function value: 0.619315
Iterations 6
Optimization terminated successfully.
Current function value: 0.620069
Iterations 6
Optimization terminated successfully.
Current function value: 0.619864
Iterations 6
Optimization terminated successfully.
Current function value: 0.621844
Iterations 5
Optimization terminated successfully.
Current function value: 0.620260
Iterations 6
-- Formula ID 4 --
Optimization terminated successfully.
Current function value: 0.632905
Iterations 6
Optimization terminated successfully.
Current function value: 0.633580
Iterations 6
Optimization terminated successfully.
Current function value: 0.633978
Iterations 6
Optimization terminated successfully.
Current function value: 0.634078
Iterations 6
Optimization terminated successfully.
Current function value: 0.634191
Iterations 6
-- Formula ID 5 --
Optimization terminated successfully.
Current function value: 0.611040
Iterations 7
Optimization terminated successfully.
Current function value: 0.612808
Iterations 7
Optimization terminated successfully.
Current function value: 0.612398
Iterations 7
Optimization terminated successfully.
Current function value: 0.614464
Iterations 7
Optimization terminated successfully.
Current function value: 0.612425
Iterations 7
-- Formula ID 6 --
Optimization terminated successfully.
Current function value: 0.639934
Iterations 5
Optimization terminated successfully.
Current function value: 0.641590
Iterations 5
Optimization terminated successfully.
Current function value: 0.641243
Iterations 5
Optimization terminated successfully.
Current function value: 0.641518
Iterations 5
Optimization terminated successfully.
Current function value: 0.641853
Iterations 5
-- Formula ID 7 --
Optimization terminated successfully.
Current function value: 0.593297
Iterations 16
Optimization terminated successfully.
Current function value: 0.596232
Iterations 16
Optimization terminated successfully.
Current function value: 0.595761
Iterations 16
Optimization terminated successfully.
Current function value: 0.596898
Iterations 16
Optimization terminated successfully.
Current function value: 0.595371
Iterations 15
cv_results = pd.concat(results_list, ignore_index=True )
cv_results.model_name.value_counts()
model_name 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 Name: count, dtype: int64
cv_results['model_name'] = cv_results['model_name'].astype('category')
sns.catplot(data = cv_results, x='model_name', hue='from_set', kind='count')
plt.show()
sns.catplot(data = cv_results.loc[ cv_results.from_set == 'testing', :],
x='model_name', y='Accuracy', kind='point', linestyle='none',
errorbar=('ci', 68))
plt.show()
sns.catplot(data = cv_results.loc[ cv_results.from_set == 'testing', :],
x='num_coefs', y='Accuracy', kind='point', linestyle='none',
hue='model_name',
errorbar=('ci', 68))
plt.show()
🌟 Based on the above POINT PLOT, Model 5 is the new BEST model because it has lesser coefficients and within the margin of error of the previous best model Model 7 🌟
best_model = smf.logit( formula=formula_list[5], data=df).fit()
Optimization terminated successfully.
Current function value: 0.612935
Iterations 7
dfviz['pred_probability'] = best_model.predict( input_grid )
g= sns.relplot(data = dfviz, x='duration_ms_log', y='pred_probability',
hue='loudness',
col='playlist_genre',
kind='line',
palette='vlag', estimator=None, units='loudness')
plt.show()
👉 Predicted Probability depends on both loudness and duration_ms
👉 Loudness consistently boosts predicted probability
👉 The effect of duration on predicted probability differs based on genre